Me: Chris Hausler, Snr Data Engineer @ Zendesk

Now: Pandas

Python Data Analysis Library (pandas)

  • open source
  • BSD-licensed library
  • high-performance
  • easy-to-use ... sort of

Creates somthing similar to R DataFrames.. but better

I think it's great, but I'm still a bit clumsy with it .. also the doco is still a little hit and miss

Some imports


In [8]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [17]:
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib 
import seaborn as sns
%matplotlib inline
pd.__version__


Out[17]:
'0.14.1'

pandas has two main data structures: Series and DataFrame

Series - Like a one dimensional array but better


In [10]:
values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals


Out[10]:
0    5
1    3
2    4
3    8
4    2
5    9
dtype: int64

Each value is now associated with an index.

The index itself is an object of class Index and can be manipulated directly.


In [11]:
vals.index


Out[11]:
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [12]:
vals.values


Out[12]:
array([5, 3, 4, 8, 2, 9])

In [13]:
vals * 2.5


Out[13]:
0    12.5
1     7.5
2    10.0
3    20.0
4     5.0
5    22.5
dtype: float64

We can use text based indexes too


In [14]:
vals2 = pd.Series(values, index=['tom','sally','jeff','george','pablo','florence'])
vals2


Out[14]:
tom         5
sally       3
jeff        4
george      8
pablo       2
florence    9
dtype: int64

And use these to get the data we want


In [18]:
vals2[['florence','tom']]


Out[18]:
florence    9
tom         5
dtype: int64

In [19]:
vals2[['florence','tom','kate']]


Out[19]:
florence     9
tom          5
kate       NaN
dtype: float64

Dealing with missing values


In [20]:
vals3 = vals2[['tom','sally','pablo','florence','ricky','katrin']]
vals3


Out[20]:
tom          5
sally        3
pablo        2
florence     9
ricky      NaN
katrin     NaN
dtype: float64

1. Get rid of them


In [22]:
vals3.dropna()


Out[22]:
tom         5
sally       3
pablo       2
florence    9
dtype: float64

2. Fill them with something


In [23]:
vals3.fillna(0)


Out[23]:
tom         5
sally       3
pablo       2
florence    9
ricky       0
katrin      0
dtype: float64

3. Fill them with a calculated value


In [24]:
vals3.fillna(vals3.mean())


Out[24]:
tom         5.00
sally       3.00
pablo       2.00
florence    9.00
ricky       4.75
katrin      4.75
dtype: float64

4. Use a function like forward fill


In [25]:
vals3.fillna(method='ffill')


Out[25]:
tom         5
sally       3
pablo       2
florence    9
ricky       9
katrin      9
dtype: float64

We also have a handy way to get a picture of our data


In [26]:
vals3.describe()


Out[26]:
count    4.000000
mean     4.750000
std      3.095696
min      2.000000
25%      2.750000
50%      4.000000
75%      6.000000
max      9.000000
dtype: float64

DataFrame

Like a 2D array... with bells and whistles


In [28]:
vals.index=pd.Index(['tom','sally','pablo','florence','ricky','katrin'])
vals3=vals3[['tom','sally','pablo','florence','billy','katrin']]

In [29]:
# create a dataframe
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat


Out[29]:
new orig
billy NaN NaN
florence 9 8
katrin NaN 9
pablo 2 4
ricky NaN 2
sally 3 3
tom 5 5

Check for nulls


In [30]:
dat.isnull()


Out[30]:
new orig
billy True True
florence False False
katrin True False
pablo False False
ricky True False
sally False False
tom False False

Drop rows with nulls


In [31]:
dat.dropna()


Out[31]:
new orig
florence 9 8
pablo 2 4
sally 3 3
tom 5 5

Timeseries with pandas DataFrames

.... a winning combination

Data from google trends.. what correlates (+ve & -ve) with the search term Hipster

Read hipster correlations from a csv file

Pandas supports many file formats for read and write including

  • csv
  • json
  • pickle
  • the clipboard

In [33]:
hipster = pd.read_csv('hipster.csv')
hipster[:10]


Out[33]:
Date hipster modcloth gumtree perth
0 4/01/04 -0.976 -0.817 -0.844
1 11/01/04 -0.816 -0.817 -0.844
2 18/01/04 -0.837 -0.817 -0.844
3 25/01/04 -0.976 -0.817 -0.844
4 1/02/04 -0.722 -0.817 -0.844
5 8/02/04 -0.795 -0.817 -0.844
6 15/02/04 -0.723 -0.817 -0.844
7 22/02/04 -0.713 -0.817 -0.844
8 29/02/04 -0.786 -0.817 -0.844
9 7/03/04 -0.675 -0.817 -0.844

Set the index to a datetime


In [34]:
hipster = hipster.set_index(pd.DatetimeIndex(hipster.pop('Date')))
hipster[:10]


Out[34]:
hipster modcloth gumtree perth
2004-04-01 -0.976 -0.817 -0.844
2004-11-01 -0.816 -0.817 -0.844
2004-01-18 -0.837 -0.817 -0.844
2004-01-25 -0.976 -0.817 -0.844
2004-01-02 -0.722 -0.817 -0.844
2004-08-02 -0.795 -0.817 -0.844
2004-02-15 -0.723 -0.817 -0.844
2004-02-22 -0.713 -0.817 -0.844
2004-02-29 -0.786 -0.817 -0.844
2004-07-03 -0.675 -0.817 -0.844

Now load the anti-Hipster data


In [36]:
not_hipster = pd.read_csv('negative_hipster.csv')
not_hipster = not_hipster.set_index(pd.DatetimeIndex(not_hipster.pop('Date')))

In [37]:
not_hipster[:10]


Out[37]:
yellow pages babelfish sony ericsson melway
2004-04-01 1.341 0.090 0.695 0.983
2004-11-01 1.239 0.256 0.846 0.714
2004-01-18 1.022 -0.023 0.558 0.694
2004-01-25 0.923 0.188 0.645 1.104
2004-01-02 0.904 -0.009 0.513 0.767
2004-08-02 0.786 0.338 0.694 1.131
2004-02-15 0.729 0.360 0.695 0.985
2004-02-22 0.537 -0.070 0.613 0.859
2004-02-29 0.534 0.273 0.447 1.286
2004-07-03 0.229 0.595 0.898 0.810

Check the values of one column


In [38]:
hipster.hipster.head()


Out[38]:
2004-04-01   -0.976
2004-11-01   -0.816
2004-01-18   -0.837
2004-01-25   -0.976
2004-01-02   -0.722
Name: hipster, dtype: float64

Check another, but get them as an numpy.ndarray


In [39]:
hipster['gumtree perth'].values[:20]


Out[39]:
array([-0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844])

View the data types, they don't need to be homogenous


In [40]:
hipster.dtypes


Out[40]:
hipster          float64
modcloth         float64
gumtree perth    float64
dtype: object

Joins on indexes are easy!


In [41]:
trend = hipster.join(not_hipster, how='inner')
trend.head()


Out[41]:
hipster modcloth gumtree perth yellow pages babelfish sony ericsson melway
2004-04-01 -0.976 -0.817 -0.844 1.341 0.090 0.695 0.983
2004-11-01 -0.816 -0.817 -0.844 1.239 0.256 0.846 0.714
2004-01-18 -0.837 -0.817 -0.844 1.022 -0.023 0.558 0.694
2004-01-25 -0.976 -0.817 -0.844 0.923 0.188 0.645 1.104
2004-01-02 -0.722 -0.817 -0.844 0.904 -0.009 0.513 0.767

We can check the column names and values


In [42]:
trend.columns


Out[42]:
Index([u'hipster', u'modcloth', u'gumtree perth', u'yellow pages', u'babelfish', u'sony ericsson', u'melway'], dtype='object')

In [43]:
trend.values


Out[43]:
array([[-0.976, -0.817, -0.844, ...,  0.09 ,  0.695,  0.983],
       [-0.816, -0.817, -0.844, ...,  0.256,  0.846,  0.714],
       [-0.837, -0.817, -0.844, ..., -0.023,  0.558,  0.694],
       ..., 
       [ 1.142,  1.175,  1.394, ..., -1.692, -1.707, -1.171],
       [ 1.187,  1.221,  1.403, ..., -1.654, -1.707, -1.16 ],
       [ 1.514,  1.216,  1.365, ..., -1.664, -1.697, -1.171]])

Filtering on date ranges is simple


In [44]:
trend['2012-01-01':].head()


Out[44]:
hipster modcloth gumtree perth yellow pages babelfish sony ericsson melway
2012-01-01 1.411 1.192 1.774 -1.077 -1.132 -0.819 -1.058
2012-08-01 1.513 1.111 1.579 -0.995 -1.128 -0.889 -1.069
2012-01-15 1.523 1.427 1.613 -1.027 -1.182 -0.991 -1.074
2012-01-22 1.600 1.490 1.514 -1.140 -1.197 -0.975 -1.106
2012-01-29 1.459 1.561 1.511 -1.046 -1.189 -1.027 -1.079

In [45]:
trend['2012-01-01': '2013-01-01'].tail(3)


Out[45]:
hipster modcloth gumtree perth yellow pages babelfish sony ericsson melway
2012-12-16 1.645 1.175 1.407 -1.433 -1.523 -1.545 -1.192
2012-12-23 1.591 1.695 1.625 -1.698 -1.540 -1.499 -1.207
2012-12-30 1.596 1.515 1.868 -1.515 -1.531 -1.535 -1.207

We can also grab a single date, or a subset of columns


In [46]:
trend.ix['2012-01-01', ['hipster', 'modcloth']]


Out[46]:
hipster modcloth
2012-01-01 1.411 1.192

Or do some boolean filtering


In [48]:
trend[trend.melway < 0].head()


Out[48]:
hipster modcloth gumtree perth yellow pages babelfish sony ericsson melway
2008-02-11 -0.613 -0.759 -0.675 0.342 0.402 0.487 -0.054
2008-12-21 -0.626 -0.700 -0.650 0.103 0.305 0.973 -0.082
2008-12-28 -0.721 -0.749 -0.617 0.371 0.160 0.744 -0.193
2009-04-01 -0.713 -0.726 -0.602 0.815 0.256 0.618 -0.070
2009-11-01 -0.706 -0.745 -0.602 0.771 0.058 0.596 -0.034

Plotting is built in and easier for dates than matplotlib


In [49]:
_ = trend.plot(figsize=(10, 6))
_ = plt.legend(loc='best', ncol=2)


We can also do it for a single column


In [50]:
_ = trend.hipster.cumsum().plot()


Or split the columns out to subplots


In [60]:
axs = trend.plot(subplots=True, figsize=(10, 10))


Resampling data is also straight forward.


In [52]:
# resample by month
trend.resample('M', how='mean').head()


Out[52]:
hipster modcloth gumtree perth yellow pages babelfish sony ericsson melway
2004-01-31 -0.84550 -0.817 -0.844 0.7915 0.16075 0.60925 0.8645
2004-02-29 -0.76725 -0.817 -0.844 0.5880 0.23400 0.60125 0.9355
2004-03-31 -0.81675 -0.817 -0.844 0.4460 0.40725 0.78550 0.8470
2004-04-30 -0.80260 -0.817 -0.844 0.6992 0.44000 0.62420 0.9614
2004-05-31 -0.77000 -0.817 -0.844 0.4052 0.53860 1.07800 0.7422

but one can do business day, week, month, quarter, annual and a bunch of others


In [67]:
trend.resample('M', how='mean').hipster.dropna().plot()


Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x109bc9950>

Other fancy plots include a scatter matrix including a kernel density estimation (KDE)


In [68]:
# look at the relations
_ = pd.scatter_matrix(trend, figsize=(12,8), diagonal='kde')


Titanic: Machine Learning from Disaster (kaggle.com)

Load the data, explore it and learn from it


In [69]:
df = pd.read_csv('train.csv', header=0)

In [70]:
df.head()


Out[70]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

Lets look at the data types here (this time they're heterogeneous)


In [71]:
df.dtypes


Out[71]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We can also get a more verbose summary


In [72]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)

DataFrames can be grouped, like in SQL (it sucked to be a young male on the titanic)


In [73]:
df_grouped = df.groupby(['Pclass', 'Sex'])

In [74]:
df_grouped[['Age', 'Survived']].mean()


Out[74]:
Age Survived
Pclass Sex
1 female 34.611765 0.968085
male 41.281386 0.368852
2 female 28.722973 0.921053
male 30.740707 0.157407
3 female 21.750000 0.500000
male 26.507589 0.135447

Histograms are straightforward


In [75]:
ax = df['Age'].dropna().hist(bins=20, range=(0,100), alpha = .5)
ax.set_xlabel('Age')
ax.set_ylabel('Passenger Count')


Out[75]:
<matplotlib.text.Text at 0x11019a710>

So are boxplots


In [101]:
bp = df.boxplot(column='Age', by='Pclass', grid=False)
# for i in set(df.Pclass):
#     y = df.Age[df.Pclass==i].dropna()
#     # Add some random "jitter" to the x-axis
#     x = np.random.normal(i, 0.04, size=len(y))
#     plt.plot(x, y, 'r.', alpha=0.2)


If we want to do some learning on this data.. lets convert gender to a binary numeric


In [77]:
df['isFemale'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
df[['Sex','isFemale']].head()


Out[77]:
Sex isFemale
0 male 0
1 female 1
2 female 1
3 female 1
4 male 0

Find non-numeric columns so we can drop them later


In [78]:
drop_cols = df.columns[df.dtypes.map(lambda x: x=='object')]
drop_cols


Out[78]:
Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')

In [79]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
isFemale       891 non-null int64
dtypes: float64(2), int64(6), object(5)

Setup our data to learn from


In [80]:
X = pd.DataFrame(df[[c for c in df.columns if c != 'Survived']])
X = X.drop(drop_cols, axis=1) 
X = X.drop('PassengerId', axis=1)
y = df.Survived
print X.head()


   Pclass  Age  SibSp  Parch     Fare  isFemale
0       3   22      1      0   7.2500         0
1       1   38      1      0  71.2833         1
2       3   26      0      0   7.9250         1
3       1   35      1      0  53.1000         1
4       3   35      0      0   8.0500         0

Have a quick look at the class distribution


In [81]:
y.groupby(y.values).count()


Out[81]:
0    549
1    342
Name: Survived, dtype: int64

and fill in some NaNs for age


In [82]:
X['Age'] = X.Age.fillna(X.Age.median())

scikit-learn

Machine Learning in Python

  • Simple and efficient tools for data mining and data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

What is Machine Learning?

Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data thanks wikipedia

Prediction with scikit-learn is easy - who will survive?


In [83]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as acc

In [84]:
# create our classifier
clf = LogisticRegression()
# fit it to the data
clf.fit(X, y)
# and predict
preds = clf.predict(X)
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)
print 'Not too bad'


Accuracy Score: 0.80
Not too bad

Cross-validation is a fairer performance estimate


In [85]:
from sklearn.cross_validation import KFold

In [86]:
cv = KFold(n=len(y), n_folds=5, shuffle=True)
preds = np.zeros_like(y)
for train, test in cv:
    clf = LogisticRegression()
    clf.fit(X.ix[train], y.ix[train])
    preds[test] = clf.predict(X.ix[test])
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)


Accuracy Score: 0.79

And cross-validation can be done more easily


In [87]:
# scikits can actually take care of this for us
from sklearn.cross_validation import cross_val_score

# here
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
# to here

print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


[ 0.78212291  0.77094972  0.78651685  0.75842697  0.81920904]
Accuracy: 0.78 (+/- 0.04)

dealing with categorical data


In [88]:
df.Embarked.head()


Out[88]:
0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object

In [89]:
set(df.Embarked.fillna('O'))


Out[89]:
{'C', 'O', 'Q', 'S'}

Use the LabelEncoder


In [90]:
from sklearn import preprocessing
df.Embarked = df.Embarked.fillna('O')
le = preprocessing.LabelEncoder()
le.fit(df.Embarked.values)
le.classes_


Out[90]:
array(['C', 'O', 'Q', 'S'], dtype=object)

In [91]:
X['Embarked'] = le.transform(df.Embarked.values)
X.Embarked.head()


Out[91]:
0    3
1    0
2    3
3    3
4    3
Name: Embarked, dtype: int64

tuning classifier parameters


In [94]:
for C in [0.001, 0.01, 0.1, 1, 10, 100]:
    clf = LogisticRegression(C=C, penalty='l1')
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')    
    print("n_estimators: {:3.3f}\tAccuracy: {:.2f} (+/- {:.2f})"
          .format(C, scores.mean(), scores.std() * 2))


n_estimators: 0.001	Accuracy: 0.66 (+/- 0.08)
n_estimators: 0.010	Accuracy: 0.67 (+/- 0.08)
n_estimators: 0.100	Accuracy: 0.79 (+/- 0.04)
n_estimators: 1.000	Accuracy: 0.79 (+/- 0.04)
n_estimators: 10.000	Accuracy: 0.78 (+/- 0.04)
n_estimators: 100.000	Accuracy: 0.78 (+/- 0.03)

Comparing classifiers is easy


In [95]:
# normalise the data
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

In [96]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.lda import LDA
from sklearn.qda import QDA

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "LDA",
         "QDA", "Logistic Regression"]
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA(),
    LogisticRegression(class_weight='auto')]

In [97]:
# fit each classifier and find the mean performance
res = []
for name, clf in zip(names, classifiers):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    res.append(scores.mean())

In [98]:
import prettyplotlib as ppl
res = np.array(res)
names = np.array(names)
idx = np.argsort(res)[::-1]
fig, ax = plt.subplots(1, figsize=(14, 6))
ppl.bar(ax, np.arange(len(res)), res[idx], annotate=True,
        xticklabels=names[idx], grid='y')
plt.xticks(rotation=30)
_ = ax.set_ylim(res.min() * 0.95, res.max() * 1.05)


Models can be pickled


In [ ]:
# models can be saved
import pickle
s = pickle.dumps(clf)

And there is a whole lot scikit-learn can do..

  • ###supervised learning
  • ###model evaluation
  • ###unsupervised learning
  • ###feature selection
  • ###feature extraction

by Andreas Mueller